-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add lance format support #7913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add lance format support #7913
Conversation
|
Mentioned #7863 as well |
|
@pdames for vis |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Cool ! I notice the current implementation doesn't support streaming because of the symlink hack. I believe you can do something like this instead: def _generate_tables(self, paths: list[str]):
for path in paths:
ds = lance.dataset(path)
for frag_idx, fragment in enumerate(ds.get_fragments()):
for batch_idx, batch in enumerate(
fragment.to_batches(columns=self.config.columns, batch_size=self.config.batch_size)
):
table = pa.Table.from_batches([batch])
table = self._cast_table(table)
yield Key(frag_idx, batch_idx), tablenote that path can be a local one, but also a |
|
@lhoestq Take another look? |
|
I took the liberty to make a few changes :) Now I believe we should be good:
I think this PR is ready, just let me know what you think before we merge 🚀 The next steps are:
Feel free to start some drafts (I noticed there are great examples in your HF account now !), I'll be happy to review :) And once Lance is available in huggingface.js and docs are ready we'll be ready to enable the Dataset Viewer and Lance code snippets on HF ! |
| yield Key(frag_idx, batch_idx), self._cast_table(table) | ||
| else: | ||
| for file_idx, lance_file in enumerate(lance_files): | ||
| for batch_idx, batch in enumerate(lance_file.read_all(batch_size=self.config.batch_size).to_batches()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we support columns pushdown here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added it at LanceFileReader() initialization, since the argument is not available in read_all()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, how does it work with multiple data files within same fragment?
In Lance, one fragment can be 1 or more data files, where each data files cover a few columns. This is how we can add new features / column cheaply without rewriting the datasets, by adding new data files to existing fragment.
Maybe we can address it as follow up tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case it's a dataset no ? since it requires a manifest or something to tell what the fragments are made of
LanceFileReader() is only used for single files, i.e. that don't belong to a lance dataset directory or require manifest files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Lets 🚢 !!
Add lance format as one of the
packaged_modules.